An Efficient OCR Error Correction Method for Japanese Text Recognition

نویسندگان

  • Toru Hisamitsu
  • Katsumi Marukawa
  • Yoshihiro Shima
  • Hiromichi Fujisawa
  • Yoshihiko Nitta
چکیده

OCR error correction using Japanese morphological analysis contains two time-consuming procedures: extraction of candidate words from combinations of candidate characters, and finding the most plausible word sequence in combinations of the candidate words. In this paper an optimal word extraction technique, and the use of lexical entries that are tailored for Japanese verb inflection, are investigated and developed. Compared to a standard method, the new method requires 84% less computation, and captures 2.6% more candidate words. The new design of lexical entries reduces the chart parsing computation by 20%. The error correction rate of the system is 86.9%, which is 19.6% higher than that of the standard one. 1 Introduction In ordinary Japanese written sentences, words are not separated by spaces. Thus error correction algorithms for Japanese text recognition using Japanese morphological analysis is forced to use two time-consuming but indispensable procedures: (1) Searching for lexical entries in combinations of characters arranged in a "candidate character lattice" (a sequence of lists of candidate characters output by a character recognition module) and extracting candidate words from a dictionary at each position in the output character string, (2) Finding the most plausible word sequence from the "candidate words lattice" (the set of words extracted by procedure (I)). Those procedures take most of the error correction time. Procedure (1) consumes more than half of the processing time [I]. Thus, a more efficient word extraction method would be helpful. At the same time, word extraction must be sufficiently precise so that as many words as possible are found. This is due to missing words resulting in linguistic post processing failure, which consequently results in a lower error correction rate. This paper investigates an optimal word extraction technique and the use of lexical entries specially tailored for Japanese verb inflection. Compared to the standard method Extl(cl) (mentioned in section 2). this method requires significantly less computation and attains a higher error correction rate even if the recognition rate of the first candidate character is quite low. The comparative results of experiments will also be shown.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evaluation of a Method to Detect and Correct Erroneous Characters in Japanese input through an OCR using Markov Models

The "Selective Error Correction Method" to judge these three types of errors, and correct them, using ra-th order Markov chain model for Japanese 'kanji-kana' characters , has been proposed and shown to be useful to detect and correct errors generated randomly (Araki et al., 1994). In this paper, this method is applied to detect and correct erroneous characters in Japanese text input through an...

متن کامل

Linguistic Error Correction Of Japanese Sentences

This paper describes a newly developed linguistic error correction system, which can correct errors and rejections of Japanese sentences by using linguistic knowledge. Conventional optical character readers (OCR) need human assistance to correct their recognition errors and rejections. An operator must teach the OCR correct answers whenever an illegible character pattern occurs. If this error c...

متن کامل

OCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...

متن کامل

Context-Based Spelling Correction for Japanese OCR

We present a novel spelling correction method ['or those languages that have no delimiter between words, such ~rs ,lap;mese, (.',hinese, ,~nd ThM. It consists of an al)proximate word matching method and an N-best word seg mental|on Mgorithm using a statistical la.nguage model. For OCR errors, the proposed word-based correction method outperf.ornrs the conventional charactm'-b`ased correction me...

متن کامل

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994